(N 

o 

(N 
G 



u 
o 

-I— > 



q 
o 

(N 



X 



Collaborative Filtering via Group-Structured 

Dictionary Learning 

Zoltan Szabo*, Barnabas Poczos^, and Andras Lorincz* 
*Faculty of Informatics, Eotvos Lorand University, Pazmany Peter setany 1/C, H-1117 Budapest, Hungary 
Email: szzoli@cs.elte.hu, andras.lorincz@elte.hu, Web: http://nipg.inf.elte.hu 
t Carnegie Mellon University, Robotics Institute, 5000 Forbes Ave, Pittsburgh, PA 15213 
Email: bapoczos@cs.cmu.edu, Web: http://www.autonlab.org 



Abstract — Structured sparse coding and the related structured 
dictionary learning problems are novel research areas in machine 
learning. In this paper we present a new application of structured 
dictionary learning for collaborative filtering based recommender 
systems. Our extensive numerical experiments demonstrate that 
the presented technique outperforms its state-of-the-art competi- 
tors and has several advantages over approaches that do not put 
structured constraints on the dictionary elements. 
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I. Introduction 

The proliferation of online services and the thriving elec- 
tronic commerce overwhelms us with alternatives in our daily 
lives. To handle this information overload and to help users 
in efficient decision making, recommender systems (RS) have 
been designed. The goal of RSs is to recommend personalized 
items for online users when they need to choose among several 
items. Typical problems include recommendations for which 
movie to watch, which jokes/books/news to read, which hotel 
to stay at, or which songs to listen to. 

One of the most popular approaches in the field of recom- 
mender systems is collaborative filtering (CF). The underlying 
idea of CF is very simple: Users generally express their tastes 
in an explicit way by rating the items. CF tries to estimate 
the users' preferences based on the ratings they have already 
made on items and based on the ratings of other, similar users. 
For a recent review on recommender systems and collaborative 
filtering, see e.g., [1]. 

Novel advances on CF show that dictionary learning based 
approaches can be efficient for making predictions about 
users' preferences [2]. The dictionary learning based approach 
assumes that (i) there is a latent, unstructured feature space 
(hidden representation) behind the users' ratings, and (ii) a 
rating of an item is equal to the product of the item and the 
user's feature. To increase the generalization capability, usually 
£2 regularization is introduced both for the dictionary and for 
the users' representation. 

There are several problems that belong to the task of dic- 
tionary learning [3], a.k.a. matrix factorization [4]. This set of 
problems includes, for example, (sparse) principal component 

A compressed version of the paper has been accepted for publication at 
the 10 th International Conference on Latent Variable Analysis and Source 
Separation (LVA/ICA 2012). 



analysis [5], independent component analysis [6], independent 
subspace analysis [7], non-negative matrix factorization [8], 
and structured dictionary learning, which will be the target of 
our paper. 

One predecessor of the structured dictionary learning prob- 
lem is the sparse coding task [9], which is a considerably 
simpler problem. Here the dictionary is already given, and we 
assume that the observations can be approximated well enough 
using only a few dictionary elements. Although finding the 
solution that uses the minimal number of dictionary elements 
is NP hard in general [10], there exist efficient approximations. 
One prominent example is the Lasso approach [11], which 
applies convex l\ relaxation to the code words. Lasso does 
not enforce any group structure on the components of the 
representation (covariates). 

However, using structured sparsity, that is, forcing different 
kind of structures (e.g., disjunct groups, trees) on the sparse 
codes can lead to increased performances in several appli- 
cations. Indeed, as it has been theoretically proved recently 
structured sparsity can ease feature selection [12], [13], and 
makes possible robust compressed sensing with substantially 
decreased observation number [14]. Many other real life 
applications also confirm the benefits of structured sparsity, 
for example (i) automatic image annotation [15], (ii) group- 
structured feature selection for micro array data processing 
[16]— [19], (iii) multi-task learning problems (a.k.a. transfer 
learning) [20]-[22], (iv) multiple kernel learning [23], [24], (v) 
face recognition [25], and (vi) structure learning in graphical 
models [26], [27]. For an excellent review on structured 
sparsity, see [28]. 

All the above mentioned examples only consider the struc- 
tured sparse coding problem, where we assume that the 
dictionary is already given and available to us. A more 
interesting (and challenging) problem is the combination of 
these two tasks, i.e., learning the best structured dictionary 
and structured representation. This is the structured dictionary 
learning (SDL) problem. SDL is more difficult; one can find 
only few solutions in the literature [29]-[34]. This novel field 
is appealing for (i) ttansformation invariant feature extraction 
[33], (ii) image denoising/inpainting [29], [31], [34], (iii) 
background subtraction [31], (iv) analysis of text corpora [29], 
and (v) face recognition [30]. 

Our goal is to extend the application domain of SDL in 
the direction of collaborative filtering. With respect to CF, 
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further constraints appear for SDL since (i) online learning 
is desired and (ii) missing information is typical. There are 
good reasons for them: novel items/users may appear and user 
preferences may change over time. Adaptation to users also 
motivate online methods. Online methods have the additional 
advantage with respect to offline ones that they can process 
more instances in the same amount of time, and in many cases 
this can lead to increased performance. For a theoretical proof 
of this claim, see [35]. Furthermore, users can evaluate only a 
small portion of the available items, which leads to incomplete 
observations, missing rating values. In order to cope with these 
constraints of the collaborative filtering problem, we will use 
a novel extension of the structured dictionary learning prob- 
lem, the so-called online group-structured dictionary learning 
(OSDL) [36]. OSDL allows (i) overlapping group structures 
with (ii) non-convex sparsity inducing regularization, (iii) 
partial observation (iv) in an online framework. 

Our paper is structured as follows: We briefly review the 
OSDL problem, its cost function, and optimization method 
in Section II. We cast the CF problem as an OSDL task in 
Section III. Numerical results are presented in Section IV. 
Conclusions are drawn in Section V. 

Notations. Vectors (a) and matrices (A) are denoted by 
bold letters, diag(a) represents the diagonal matrix with coor- 
dinates of vector a in its diagonal. The i th coordinate of vector 
a is a,. Notation | • | means the number of elements of a set and 
the absolute value for a real number. For set O C {1, . . . , d}, 
&o G ft' ' denotes the coordinates of vector a G M d in O. For 
matrix A G R dxD , A G M |0|x£l stands for the restriction 
of matrix A to the rows O. I and denote the identity and 
the null matrices, respectively. A T is the transposed form of 
A. For a vector, the max operator acts coordinate-wise. The 



v p (quasi-)norm of vector a e I" is ||a|| p = (2^i=i \ai\ y ) p 
(p > 0). S d = {a G M. d : ||a|| p < 1} denotes the £ p unit sphere 
in R d . The point-wise and scalar products of a, b G M. d are 
denoted by a o b = [a\bi; . . . ; aaba] and by (a, b) = a T b, 
respectively. For a set system S, the coordinates of vector 
a G M |s| are denoted by a G (G G 9), that is, a = (a G ) GeS . 



is a 



n e (x) 



argmin cee ||x — c|| 2 is the projection of point 



x G K to the convex closed set C C 



Partial derivative of 



function h w.r.t. variable x in xo is ^(xo). The non-negative 
ortant of R d is W*_ = {x G R d : x, >*0 (Vi)}. For sets, x and 
\ denote direct product and difference, respectively. 

II. The OSDL Problem 

In this section we briefly review the OSDL approach, which 
will be our major tool to solve the CF problem. The OSDL 
cost function is treated in Section II-A, its optimization idea 
is detailed in Section II-B. 

A. Cost Function 

The online group-structured dictionary learning (OSDL) 
task is defined with the following quantities. Let the dimension 
of the observations be denoted by d x . Assume that in each 
time instant (i — 1,2,...) a set Oi C {1, . . . , d x } is given, 
that is, we know which coordinates are observable at time i, 
and the observation is xo ; . Our goal is to find a dictionary 



D G M dxXd ° that can approximate the observations xo ; well 
from the linear combination of its columns. The columns of 
D are assumed to belong to a closed, convex, and bounded 
set D = xJ^Di. To formulate the cost of dictionary D, 
first a fixed time instant i, observation xo, , dictionary D is 
considered, and the hidden representation on associated to this 
(xo; , D, Oi) triple is defined. Representation on is allowed to 
belong to a closed, convex set A C R da (a, G -A) with 
certain structural constraints. The structural constraint on ccj 
are expressed by making use of a given Q group structure, 
which is a set system (also called hypergraph) on {1, . . . , d a }. 
It is also assumed that weight vectors d G G R da (G G S) 
are available for us and that they are positive on G and 
otherwise. Representation a belonging to a triple (xo,D,0) 
is defined as the solution of the structured sparse coding task 



;(xo,Dc 



^K,S,{d G } Ge< ,,r,( x O, Do) 

'1 



mm 



||xo - Doa|| 2 + «fi(a) 



(1) 
(2) 



where l(xo,Do) denotes the loss, n > 0, and 

n(y) = %{d- Wf) (y) = ll(l|d G o y || 2 ) Ges ||, (3) 

is the structured regularizer associated to S and {d G }Geg, 
r\ G (0, 2). Here, the first term of (2) is responsible for the 
quality of approximation on the observed coordinates, whereas 
for f] < 1 the other term [(3)] constrains the solution according 
to the group structure S similarly to the sparsity inducing 
regularizer in [30]: it eliminates the terms ||d G o y|| 2 



(G G S) by means of 



The OSDL problem is defined 



as the minimization of the cost function: 



min ft(D) := — -. 

DeD E=i(iA)" 



t 



Z(x 0i ,D 0i ), (4) 



that is, the goal is to minimize the average loss belonging to 
the dictionary, where p is a non-negative forgetting factor. If 
p = 0, the classical average / t (D) = | E*=i K x Oi: D 0i ) is 
recovered. 

As an example, let D t = (Vi), A = R d °. In this 
case, columns of D are restricted to the Euclidean unit sphere 
and we have no constraints for a. Now, let |S| = d a and 
S = {desci, . . . , descd a }, where desci represents the i th node 
and its children in a fixed tree. Then the coordinates on are 
searched in a hierarchical tree structure and the hierarchical 
dictionary D is optimized accordingly. 

B. Optimization 

Optimization of cost function (4) is equivalent to the joint 
optimization of dictionary D and representation {e*i}* =1 : 



argmin / t (D, {ai}* =1 ), 

DeB,{a,e^ =1 



(5) 



where 



i v 



1 2 

- ||x 0i - D 0i a,|| 2 + KQ.{a.i 



(6) 

D is optimized by using the sequential observations x Gi 
online in an alternating manner: 
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1) The actual dictionary estimation D t i and sample xo, 
is used to optimize (2) for representation a t . 

2) For the estimated representations {a,}| =1 , the dictio- 
nary estimation D t is derived from the quadratic opti- 
mization problem 

/t(D t )= min/ t (D,{aO* = i)- (7) 

1) Representation optimization (a): Note that (2) is a 
non-convex optimization problem with respect to a. The 
variational properties of norm r) can be used to overcome this 
problem. One can show, alike to [30], that by introducing an 
auxiliary variable z G ', the solution a. of the optimization 
task (9) is equal to the solution of (2): 

argmin J(a,z), where (8) 
J(a,z)= aeA,zem.^ 1 (9) 

= g H x °* _ ( D *-i)o t a|l2 + K ^ (a T diag(C)a+ ||z||^), 

C = C(z) G and Cj = £ Ge g, G3j - /* G - The 

optimization of (9) can be carried out by iterative alternating 
steps. One can minimize the quadratic cost function on the 
convex set A for a given z with standard solvers [37]. 
Then, one can use the variation principle and find solution 
z = (z g )g<es f° r a fixed a by means of the explicit expression 

z G = ||d G oa||^||(||d G oa|| 2 ) GeS ||^ 1 . (10) 

Note that for numerical stability, smoothing z = max(z,e) 
(0 < e <C 1) is suggested in practice. 

2) Dictionary optimization (TD): The block-coordinate de- 
scent (BCD) method [37] is used for the optimization of D: 
columns dj in D are optimized one-by-one by keeping the 
other columns (dj, i ^ j) fixed. For a given j, f t is quadratic 
in dj. The minimum is found by solving ^fr(uj) = 0, 
and then this solution is projected to the constraint set Dj 
(dj <— H-o^Uj)). One can show by executing the differenti- 
ation that Uj satisfies the linear equation system 

c i,tUj = l> • - <\,.< + (■ •<!. (H) 

where 

i—l ^ ' 
t / ■ \ P 

ej,t = $Zu) AiDajOjjeR^, (13) 
»=i ^ ' 

B * = S (^j P AiXiCxf = [bi,t,...,bd a ,t], (14) 

matrices C jtt are diagonal, B t G E d » xd «, and A; G M d - Xd - 
is the diagonal matrix representation of the Oi set (for j 6 
Oi the j th diagonal is 1 and is otherwise). It is sufficient 
to update statistics {{C,-,*} B t , {a,,*} -f^} online for the 
optimization of f t , which can be done exactly for Cj t and 
B t : 

C jtt = 7tC j>t -i + A t a 2 tj , (15) 
B t = 7t B t _i + A t x t af , (16) 



where j t = (l — |) p and the recursions are initialized by (i) 
Cjfi = 0, B = for p = and (ii) in an arbitrary way for 
p > 0. According to numerical experiences, 

ej,t = 7t e i,t-i + A t D t a t a t j, (17) 

is a good approximation for ej ;t with the actual estimation 
T> t and with initialization e^o = 0. It may be worth noting 
that the convergence speed is often improved if statistics are 
updated in mini-batches {xo t l , . . . , xo t R }.' 

III. OSDL Based Collaborative Filtering 

We formulate the CF task as an OSDL optimization problem 
in Section III-A. According to the CF literature, oftentimes 
neighbor-based corrections improve the precision of the esti- 
mation. We also use this technique (Section III-B) to improve 
the OSDL estimations. 

A. CF Casted as an OSDL Problem 

Below, we transform the CF task into an OSDL problem. 
Consider the t th user's known ratings as OSDL observations 
xo,. Let the optimized group-structured dictionary on these 
observations be D. Now, assume that we have a test user 
and his/her ratings, i.e., x G G M' '. The task is to estimate 
x {i....,d x }\o> that is> the missing coordinates of x (the missing 
ratings of the user) that can be accomplished as follows: 

1) Remove the rows of the non-observed {1, . . . , d x }\0 
coordinates from D. The obtained \0\ xd a sized matrix 
Dq and xq can be used to estimate a by solving the 
structured sparse coding problem (2). 

2) Using the estimated representation a, estimate x as 

x = Da. (18) 

B. Neighbor Based Correction 

According to the CF literature, neighbor based correction 
schemes may further improve the precision of the estimations 
[1]. This neighbor correction approach 

• relies on the assumption that similar items (e.g., 
jokes/movies) are rated similarly and 

• can be adapted to OSDL-based CF estimation in a natural 
fashion. 

Here, we detail the idea. Let us assume that the similarities 
Sij G M. G {1, . . . , d x }) between individual items are 
given. We shall provide similarity forms in Section IV-B. Let 
dkCtt el be the OSDL estimation for the rating of the k th 
non-observed item of the t th user (k $ Ot), where d^ G 

R lxd a is the k th rQW of matrix D e |ixd aj an( j at g R d a 

is computed according to Section III-A. 

Let the prediction error on the observable item neighbors (j) 
of the k th item of the t th user (j G O t \{k}) be d 3 a. t - x jt G 
R. These prediction errors can be used for the correction of 

'The Matlab code of the OSDL method is available at http://nipg.inf.elte. 
hu/szzoli. 
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the OSDL estimation (dkCtt) by taking into account the Sij 
similarities: 



and MAE measure is the average squared/absolute difference 
of the true and the estimated rating values, respectively: 



x kt = d k a t + 71 



£jeo t \{fc} s kj( d 3 a t -Xj t ) 



Xkt = 7o(d fc a 4 ) +71 



^2jeo t \{k} s kj 
^jeo t \{k} s kj{djC*t - x jt ) 



or (19) 



, (20) 



where k Ot- Here, (19) is analogous to the form of [2], 
(20) is a simple modification: it modulates the first term with 
a separate 70 weight. 



IV. Numerical Results 

We have chosen the Jester dataset (Section IV-A) for the 
illustration of the OSDL based CF approach. It is a standard 
benchmark for CF. We detail our preferred item similarities in 
Section IV-B. To evaluate the CF based estimation, we use the 
performance measures given in Section IV-C. Section IV-D is 
about our numerical experiences. 



A. The Jester Dataset 

The dataset [38] contains 4,136,360 ratings from 73,421 
users to 100 jokes on a continuous [—10, 10] range. The worst 
and best possible gradings are —10 and +10, respectively. A 
fixed 10 element subset of the jokes is called gauge set and it 
was evaluated by all users. Two third of the users have rated 
at least 36 jokes, and the remaining ones have rated between 
15 and 35 jokes. The average number of user ratings per joke 
is 46. 



B. Item Similarities 

In the neighbor correction step (19) or (20) we need the s,j 
values representing the similarities of the i th and j th items. 
We define this value as the similarity of the i th and j th rows 
(di and dj) of the optimized OSDL dictionary D [2]: 



S*i : = Sij(di,dj) 



, or (21) 




where (3 > is the parameter of the similarity measure. 
Quantities are non-negative; if the value of is close 
to zero (large) then the i th and j th items are very different 
(very similar). 



C. Performance Measure 

In our numerical experiments we used the RMSE (root mean 
square error) and the MAE (mean absolute error) measure for 
the evaluation of the quality of the estimation, since these are 
the most popular measures in the CF literature. The RMSE 



RMSE = 



MAE 



L (*« - x it )\ 
\ 1 1 (i,t)e§ 

uTj" \xu — Xit\, 



(23) 
(24) 



(i,t)es 

where S denotes either the validation or the test set. 
D. Evaluation 

Here we illustrate the efficiency of the OSDL-based CF es- 
timation on the Jester dataset (Section IV-A) using the RMSE 
and MAE performance measures (Section IV-C). We start our 
discussion with the RMSE results. The MAE performance 
measure led to similar results; for the sake of completeness we 
report these results at the end of this section. To the best of our 
knowledge, the top results on this database are RMSE = 4.1123 
[39] and RMSE = 4.1229 [2]. Both works are from the same 
authors. The method in the first paper is called item neighbor 
and it makes use of only neighbor information. In [2], the 
authors used a bridge regression based unstructured dictionary 
learning model — with a neighbor correction scheme — , they 
optimized the dictionary by gradient descent and set d a to 
100. These are our performance baselines. 

To study the capability of the OSDL approach in CF, we 
focused on the following issues: 

• Is structured dictionary D beneficial for prediction pur- 
poses, and how does it compare to the dictionary of 
classical (unstructured) sparse dictionary? 

• How does the OSDL parameters and the similar- 
ity/neighbor correction applied affect the efficiency of the 
prediction? 

• How do different group structures S fit to the CF task? 
In our numerical studies we chose the Euclidean unit sphere 



for T>i = So x (Vi), and A 



< , and no additional weighting 



was applied (d G = xg, VG G S, where \ is the indicator 
function). We set r\ of the group-structured regularizer f2 to 
0.5. Group structure S of vector a was realized on 

> a d x d toroid (d a = d 2 ) with |9| = d a applying r > 
neighbors to define S. For r = (S = {{1}, . . . , {d a }}) 
the classical sparse representation based dictionary is 
recovered. 

• a hierarchy with a complete binary tree structure. In this 
case: 

- |S| = d a , and group G of on contains the i th node 
and its descendants on the tree, and 

- the size of the tree is determined by the number of 
levels I. The dimension of the hidden representation 
is then d a = 2 — 1. 

The size R of mini-batches was set either to 8, or 
to 16 and the forgetting factor p was chosen from 
set {0, 53,55, jg, The n weight of struc- 

ture inducing regularizer 51 was chosen from the set 
{^t, jo, jr, jr, jz, 56 , • • • , an-}- We studied similarities Si, 
S2 [see (21)-(22)] with both neighbor correction schemes 



5 



[(19)-(20)]. In what follows, corrections based on (19) and (20) 
will be called Si, S2 and S®, S°, respectively. Similarity pa- 
rameter (3 was chosen from the set {0.2, 1, 1.8, 2.6, . . . , 14.6}. 
In the BCD step of the optimization of D, 5 iterations were 
applied. In the a optimization step, we used 5 iterations, 
whereas smoothing parameter e was 10~ 5 . 

We used a 90% — 10% random split for the observable 
ratings in our experiments, similarly to [2]: 

• training set (90%) was further divided into 2 parts: 

- we chose the 80% observation set {O t } randomly, 
and optimized D according to the corresponding xo t 
observations, 

- we used the remaining 10% for validation, that is 
for choosing the optimal OSDL parameters (r or 
I, k, p), BCD optimization parameter (R), neighbor 
correction (Si, S2, S®, S°), similarity parameter (/?), 
and correction weights (7jS in (19) or (20)). 

> we used the remaining 10% of the data for testing. 
The optimal parameters were estimated on the validation set, 
and then used on the test set. The resulting RMSE/MAE score 
was the performance of the estimation. 

1) To mid Group Structure.: In this section we provide 
results using toroid group structure. We set d = 10. The size 
of the toroid was 10 x 10, and thus the dimension of the 
representation was d a = 100. 

In the first experiment we study how the size of neighbor- 
hood (r) affects the results. This parameter corresponds to the 
"smoothness" imposed on the group structure: when r = 0, 
then there is no relation between the d 7 <E M d ° columns in D 
(no structure). As we increase r, the cP feature vectors will be 
more and more aligned in a smooth way. To this end, we set the 
neighborhood size to r = (no structure), and then increased 
it to 1, 2, 3, 4, and 5. For each (k, p, we calculated the 
RMSE of our estimation, and then for each fixed (k, p) pair, we 
minimized these RMSE values in (3. The resulting validation 
and test surfaces are shown in Fig. 1. For the best (k, p) pair, 
we also present the RMSE values as a function of (3 (Fig. 2). 
In this illustration we used S° neighbor correction and R = 8 
mini-batch size. We note that we got similar results using 
R = 16 too. Our results can be summarized as follows. 

• For a fixed neighborhood parameter r, we have that: 

- The validation and test surfaces are very similar 
(see Fig. l(e)-(f)). It implies that the validation 
surfaces are good indicators for the test errors. For 
the best r, k and p parameters, we can observe 
that the validation and test curves (as functions of 
j3) are very similar. This is demonstrated in Fig. 2, 
where we used r = 4 neighborhood size and S® 
neighbor correction. We can also notice that (i) both 
curves have only one local minimum, and (ii) these 
minimum points are close to each other. 

- The quality of the estimation depends mostly on the 
k regularization parameter. As we increase r, the best 
k value is decreasing. 

- The estimation is robust to the different choices of 
forgetting factors (see Fig. l(a)-(e)). In other words, 
this parameter p can help in fine-tuning the results. 



• Structured dictionaries (r > 0) are advantageous over 
those methods that do not impose structure on the dic- 
tionary elements (r = 0). For S® and SJ> neighbor 
corrections, we summarize the RMSE results in Table I. 
Based on this table we can conclude that in the studied 
parameter domain 

- the estimation is robust to the selection of the mini- 
batch size (R). We got the best results using R = 8. 
Similarly to the role of parameter p, adjusting R can 
be used for fine-tuning. 

- the Si neighbor correction lead to the smallest 
RMSE value. 

- When we increase r up to r = 4, the results improve. 
However, for r — 5, the RMSE values do not 
improve anymore; they are about the same that we 
have using r = 4. 

- The smallest RMSE we could achieve was 4.0774, 
and the best known result so far was RMSE = 4.1123 
[39]. This proves the efficiency of our OSDL based 
collaborative filtering algorithm. 

- We note that our RMSE result seems to be signif- 
icantly better than the that of the competitors: we 
repeated this experiment 5 more times with different 
randomly selected training, test, and validation sets, 
and our RMSE results have never been worse than 
4.08. 



C/5 

s 

a 4.08 
4.07 
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— Test curve 
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Similarity parameter (f!) 



Fig. 2: RMSE validation and test curves for toroid group 
structure using the optimal neighborhood size r = 4, regular- 
ization weight k = 2irr? forgetting factor p = mini -batch 
size R — 8, and similarity parameter j3 = 3.4. The applied 
neighbor correction was S®. 

In the second experiment we studied how the different 
neighbor corrections (Si, S2, S®, S°) affect the performance 
of the proposed algorithm. To this end, we set the neighbor- 
hood parameter to r = 4 because it proved to be optimal in the 
previous experiment. Our results are summarized in Table II. 
From these results we can observe that 

• our method is robust to the selection of correction meth- 
ods. Similarly to the p and R parameters, the neighbor 
correction scheme can help in fine-tuning the results. 

• The introduction of 70 in (20) with the application of S° 
and S® instead of Si and S2 proved to be advantageous 
in the neighbor correction phase. 

• For the studied CF problem, the S° neighbor correction 
method (with R = 8) lead to the smallest RMSE value, 
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Forgetting factor (p) 

(d) 



1/64 1/32 1/16 1/8 1/4 1/2 
Forgetting factor (p) 

(e) 



1/64 1/32 1/16 1/8 1/4 1/2 
Forgetting factor (p) 

(f) 



Fig. 1: RMSE validation surfaces [(a)-(e)] and test surfaces (f) as a function of forgetting factor (p) and regularization (k). 
For a fixed (n, p) parameter pair, the surfaces show the best RMSE values optimized in the (3 similarity parameter. The group 
structure (9) is toroid. The applied neighbor correction was S°. (a): r = (no structure), (b): r = 1. (c): r = 2. (d): r = 3. 
(e)-(f): r = 4, on the same scale. 



TABLE I: Performance (RMSE) of the OSDL prediction using 
toroid group structure (9) with different neighbor sizes r 
(r = 0: unstructured case). First-second row: mini-batch size 
R = 8, third-fourth row: R = 16. Odd rows: S®, even rows: 
S® neighbor correction. For fixed R, the best performance is 
highlighted with boldface typesetting. 







r = 


r = 1 


r = 2 


r = 3 


r = 4 


R = 8 


sy 


4.1594 


4.1326 


4.1274 


4.0792 


4.0774 






4.1765 


4.1496 


4.1374 


4.0815 


4.0802 


R= 16 


sy 


4.1611 


4.1321 


4.1255 


4.0804 


4.0777 




s 2 u 


4.1797 


4.1487 


4.1367 


4.0826 


4.0802 



4.0774. 

• The Re {8, 16} setting yielded us similarly good results. 
Even with R = 16, the RMSE value was 4.0777. 

2) Hierarchical Group Structure.: In this section we pro- 
vide results using hierarchical a representation. The group 
structure 9 was chosen to represent a complete binary tree. 

In our third experiment we study how the number of levels 
(I) of the tree affects the results. To this end, we set the number 
of levels to I = 3, 4, 5, and 6. Since d a , the dimension of the 
hidden representation a, equals to 2 l — 1, these I values give 
rise to dimensions d a = 7, 15, 31, and 63. Validation and test 



TABLE II: Performance (RMSE) of the OSDL prediction for 
different neighbor corrections using toroid group structure (9). 
Columns: applied neighbor corrections. Rows: mini-batch size 
R = 8 and 16. The neighbor size was set to r = 4. For fixed R, 
the best performance is highlighted with boldface typesetting. 





Si 


S 2 


S\> 


cU 


R = 8 


4.0805 


4.0844 


4.0774 


4.0802 


R= 16 


4.0809 


4.0843 


4.0777 


4.0802 



surfaces are provided in Fig. 3(a)-(c) and (e)-(f), respectively. 
The surfaces show for each (k, p) pair, the minimum RMSE 
values taken in the similarity parameter (3. For the best (k, p) 
parameter pair, the dependence of RMSE on f3 is presented in 
Fig. 3(d). In this illustration we used S° neighbor correction, 
and the mini-batch size was set to R = 8. Our results are 
summarized below. We note that we obtained similar results 
with mini -batch size R = 16. 

• For fixed number of levels I, similarly to the toroid group 
structure (where the size r of the neighborhood was 
fixed), 

- validation and test surfaces are very similar, see 
Fig. 3(b)-(c). Validation and test curves as a function 
of /3 behave alike, see Fig. 3(d). 
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TABLE III: Performance (RMSE) of the OSDL prediction for 
different number of levels (I) using binary tree structure (9). 
First-second row: mini-batch size R = 8, third-fourth row: 
R = 16. Odd rows: S°, even rows: S° neighbor correction. 
For fixed R, the best performance is highlighted with boldface 
typesetting. 







i = 3 


i = 4 


i = 5 


i = 6 


R = 8 


sy 


4.1572 


4.1220 


4.1241 


4.1374 






4.1669 


4.1285 


4.1298 


4.1362 


R = 16 


sy 


4.1578 


4.1261 


4.1249 


4.1373 




s 2 ° 


4.1638 


4.1332 


4.1303 


4.1383 



TABLE IV: Performance (RMSE) of the OSDL prediction for 
different neighbor corrections using binary tree structure (S). 
Rows: mini-batch size R = 8 and 16. Columns: neighbor 
corrections. Neighbor size: r — 4. For fixed R, the best 
performance is highlighted with boldface typesetting. 





Si 


S 2 


sy 




R = 8 


4.1255 


4.1338 


4.1220 


4.1285 


R = 16 


4.1296 


4.1378 


4.1261 


4.1332 



- the precision of the estimation depends mostly on 
the regularization parameter k; forgetting factor p 
enables fine-tuning. 

• The obtained RMSE values are summarized in Table III 
for Si and Sj neighbor corrections. According to the 
table, the quality of estimation is about the same for 
mini-batch size R = 8 and R = 16; the R = 8 based 
estimation seems somewhat more precise. Considering 
the neighbor correction schemes S° and S$, provided 
better predictions. 

• As a function of the number of levels, we got the best 
result for I = 4, RMSE = 4.1220; RMSE values decrease 
until I — 4 and then increase for I > 4. 

• Our best obtained RMSE value is 4.1220; it was achieved 
for dimension only d a = 15. We note that this small 
dimensional, hierarchical group structure based result is 
also better than that of [2] with RMSE = 4.1229, which 
makes use of unstructured dictionaries with d a = 100. 
The result is also competitive with the RMSE = 4.1123 
value of [39]. 

In our fourth experiment we investigate how the different 
neighbor corrections (Si, S2, S®, S°) affect the precision of 
the estimations. We fixed the number of levels to I = 4, since 
it proved to be the optimal choice in our previous experiment. 
Our results are summarized in Table IV. We found that 

• the estimation is robust to the choice of neighbor correc- 
tions, 

• it is worth including weight 70 [see (20)] to improve the 
precision of prediction, that is, to apply correction S® and 
S*2 instead of Si and S2, respectively. 

> the studied R 6 {8, 16} mini-batch sizes provided simi- 
larly good results. 

• for the studied CF problem the best RMSE value was 
achieved using Sj 1 neighbor correction and mini-batch 
size R = 8. 

When we used the MAE performance measure, our results 







— Validation curve 
— Test curve 













Similarity parameter (p) 

Fig. 5: MAE validation and test curves for toroid group struc- 
ture using the optimal neighborhood size r = 4, regularization 
weight k = 2T0, forgetting factor p = ^5, mini-batch size 
R = 8, and similarity parameter /3 = 3.4. The applied 
neighbor correction was S°. 

TABLE V: Performance (MAE) of the OSDL prediction using 
toroid group structure (3) with different neighbor sizes r 
(r = 0: unstructured case). First-second row: mini-batch size 
R = 8, third-fourth row: R = 16. Odd rows: Sf, even rows: 
S° neighbor correction. For fixed R, the best performance is 
highlighted with boldface typesetting. 







r = 


r = 1 


r = 2 


r = 3 


r = 4 


R = 8 




3.2225 


3.2019 


3.1989 


3.1563 


3.1544 




s 2 u 


3.2371 


3.2151 


3.2085 


3.1584 


3.1571 


R= 16 


sy 


3.2220 


3.1988 


3.1982 


3.1576 


3.1546 




s 2 u 


3.2382 


3.2147 


3.2101 


3.1594 


3.1568 



were similar to those of the RMSE. We got the best results 
using toroid group structure, thus we present more details for 
this case. 

• With the usage of structured dictionaries we can get better 
results: the estimation errors were decreasing when we 
increased the neighbor size r up to 4. (Table V). The 
validation and test surfaces/curves are very similar, see 
Fig. 4(e)-(f), Fig. 5. 

• The quality of the estimation depends mostly on the k 
regularization parameter (Fig. 4(a)-(e)). The applied p 
forgetting factor, R mini-batch size and neighbor cor- 
rection method can help in fine-tuning the results, see 
Fig. 4(a)-(e), Table V and Table VI, respectively. 

• The smallest MAE we could achieve was 3.1544, using 
r = 4 neighbor size, S^ neighbor correction and R = 8 
mini-batch size. The baseline methods led to [39] MAE 
= 3.1616, [2] MAE = 3.1606 results. Our approach 
outperformed both of the state-of-the-art competitors. We 
also repeated this experiment 5 more times with different 
randomly selected training, test, and validation sets, and 
our MAE results have never been worse than 3.155. This 
demonstrates the efficiency of our approach. 

V. Conclusions 

We have dealt with collaborative filtering (CF) based rec- 
ommender systems and extended the application domain of 
structured dictionaries to CF. We used online group-structured 
dictionary learning (OSDL) to solve the CF problem; we 
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4.16r 
4.15 



— Validation curve 
— Test curve 




5 10 
Similarity parameter (P) 



(d) 




1/64 1/32 1/16 1/8 1/4 
Forgetting factor (p) 

(e) 




1/64 1/32 1/16 1/8 1/4 1/2 
Forgetting factor (p) 

(f) 



Fig. 3: RMSE validation surfaces [(a)-(b), (e)-(f)] and test surfaces (c) as a function of forgetting factor (p) and regularization 
(k). (d): validation and test curve using the optimal number of levels I = 4, regularization weight n = forgetting factor 
p = 0, mini-bach size R = 8, similarity parameter (3 = 1.8. Group structure (9): complete binary tree. Neighbor correction: S°. 
(a)-(c),(e)-(f): for fixed (k, p) parameter pair, the surfaces show the best RMSE values optimized in the (3 similarity parameter, 
(a): I = 3. (b)-(c): I = 4, on the same scale, (e): I = 5. (f): I = 6. 



TABLE VI: Performance (MAE) of the OSDL prediction for 
different neighbor corrections using toroid group structure (9). 
Columns: applied neighbor corrections. Rows: mini-batch size 
R = 8 and 16. The neighbor size was set to r = 4. For fixed R, 
the best performance is highlighted with boldface typesetting. 





Si 


S 2 


sy 


cU 


R = 8 


3.1719 


3.1779 


3.1544 


3.1571 


R = 16 


3.1726 


3.1778 


3.1546 


3.1568 



casted the CF estimation task as an OSDL problem. We 
demonstrated the applicability of our novel approach on joke 
recommendations. Our extensive numerical experiments show 
that structured dictionaries have several advantages over the 
state-of-the-art CF methods: more precise estimation can be 
obtained, and smaller dimensional feature representation can 
be sufficient by applying group structured dictionaries. More- 
over, the estimation behaves robustly as a function of the 
OSDL parameters and the applied group structure. 
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